Linear Regression - Interpreting the result

In this notebook we use linear regression to predict the coefficients corresponding to the top eigenvectors of the measurements:

  • TAVG: The average temperature for day/location. (TMAX + TMIN)/2
  • TRANGE: The temperature range between the highest and lowest temperatures of the day. TMAX-TMIN.
  • SNWD: The depth of the accumulated snow.

These 9 variables are the output variables that we aim to predict.

The 4 input variables we use for the regression are properties of the location of the station:

  • latitude, longitude: location of the station.
  • elevation: the elevation of the location above sea level.
  • dist_coast: the distance of the station from the coast (in kilometers).

Read and parse the data


In [65]:
import pickle
import pandas as pd
!ls *.pickle  # check


stations_projections.pickle

In [66]:
!curl -o "stations_projections.pickle" "http://mas-dse-open.s3.amazonaws.com/Weather/stations_projections.pickle"


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2750k  100 2750k    0     0  1100k      0  0:00:02  0:00:02 --:--:-- 1100k

In [67]:
data = pickle.load(open("stations_projections.pickle",'r'))
data.shape


Out[67]:
(12140, 8)

In [68]:
data.head(1)


Out[68]:
station TAVG_coeff TRANGE_coeff SNWD_coeff latitude longitude elevation dist_coast
0 USC00044534 [3047.96236332, 1974.34852034, 150.560792408] [-2903.63287861, -236.907267527, 147.021790682] [0.19150300062, 0.187262808215, -0.0401379552536] 36.0042 -119.96 73.2 107.655

In [69]:
# break up the lists of coefficients separate columns
for col in [u'TAVG_coeff', u'TRANGE_coeff', u'SNWD_coeff']:
    for i in range(3):
        new_col=col+str(i+1)
        data[new_col]=[e[i] for e in list(data[col])]
    data.drop(labels=col,axis=1,inplace=True)
data.drop(labels='station',axis=1,inplace=True)
print data.columns
data.head(3)


Index([     u'latitude',     u'longitude',     u'elevation',    u'dist_coast',
         u'TAVG_coeff1',   u'TAVG_coeff2',   u'TAVG_coeff3', u'TRANGE_coeff1',
       u'TRANGE_coeff2', u'TRANGE_coeff3',   u'SNWD_coeff1',   u'SNWD_coeff2',
         u'SNWD_coeff3'],
      dtype='object')
Out[69]:
latitude longitude elevation dist_coast TAVG_coeff1 TAVG_coeff2 TAVG_coeff3 TRANGE_coeff1 TRANGE_coeff2 TRANGE_coeff3 SNWD_coeff1 SNWD_coeff2 SNWD_coeff3
0 36.0042 -119.9600 73.2 107.65500 3047.962363 1974.348520 150.560792 -2903.632879 -236.907268 147.021791 0.191503 0.187263 -0.040138
1 42.7519 -124.5011 12.8 0.61097 2072.149003 880.454659 -19.403966 -1588.344065 22.091593 53.905710 0.315438 0.126292 0.792079
2 47.1064 -104.7183 632.8 1316.54000 949.764151 2361.836952 132.430209 -2802.638187 -165.774139 152.216161 745.947252 256.091735 113.675894

Performing and evaluating the regression

As the size of the data is modest, we can perform the regression using regular python (not spark) running on a laptop. We use the library sklearn


In [70]:
from sklearn.linear_model import LinearRegression

Coefficient of determination

Computed by calling the method LinearRegression.score()

The regression score comes under several names: "Coefficient of determination", $R^2$, "R squared score", "percentage of variance explained", "correlation coefficient". It is explained in more detail in wikipedia.

Roughly speaking the $R^2$-score measures the fraction of the variance of the regression output variable that is explained by the prediction function. The score varies between 0 and 1. A score of 1 means that the regression function perfectly predicts the value of $y$. A score of 0 means that it does not predict $y$ at all.

Training score vs Test score

Suppose we fit a regression function with 10 features to 10 data points. We are very likely to fit the data perfectly and get a score of 1. However, this does not mean that our model truly explains the data. It just means that the number of training examples we are using to fit the model is too small. To detect this situation, we can compute the score of the model that was fit to the training set, on a test set. If the ratio between the test score and the training score is smaller than, say, 0.1, then our regression function probably over-fits the data.

Finding the importance of input variables

The fact that a regression coefficient is far from zero provides some indication that it is important. However, the size of these coefficients also depends on the scaling of the variables. A much more reliable way to find out which of the input variables are important is to compare the score of the regression function we get when using all of the input variables to the score when one of the variables is eliminated. This is sometimes called "sensitivity analysis"


In [86]:
# Compute score changes
def compute_scores(y_label,X_Train,y_Train,X_test,Y_test):
    lg = LinearRegression()
    lg.fit(X_Train,y_Train)

    train_score = lg.score(X_Train,y_Train)
    test_score = lg.score(X_test,Y_test)
    print('R-squared(Coeff. of determination): Train:%.3f, Test:%.3f, Ratio:%.3f\n' % (train_score,test_score,(test_score/train_score)))

    full=set(range(X_Train.shape[1])) #col index list
    for i in range(X_Train.shape[1]):
        L=list(full.difference(set([i])))  # fill in
        L.sort()
        r_train_X=X_Train[:,L]
        r_test_X=X_test[:,L]
        
        lg = LinearRegression()
        lg.fit(r_train_X,y_Train)
        r_train_score = lg.score(r_train_X,y_Train)
        r_test_score  = lg.score(r_test_X,Y_test)
        print "removed",data.columns[i],
        print "Score decrease: \tTrain:%5.3f" % (train_score-r_train_score),
        print "\tTest: %5.3f " % (test_score-r_test_score)

Partition into training set and test set

By dividing the data into two parts, we can detect when our model over-fits. When over-fitting happens, the significance on the test set is much smaller than the significance on the training set.


In [ ]:


In [87]:
from numpy.random import rand
N=data.shape[0]
train_i = rand(N)>0.5
Train = data.ix[train_i,:]
Test  = data.ix[~train_i,:]
print data.shape,Train.shape,Test.shape


(12140, 13) (6110, 13) (6030, 13)

In [88]:
print Train.ix[:,:4].head()


   latitude  longitude  elevation  dist_coast
0   36.0042  -119.9600       73.2   107.65500
1   42.7519  -124.5011       12.8     0.61097
2   47.1064  -104.7183      632.8  1316.54000
3   41.7500   -84.2167      247.2   685.50100
6   43.5167  -104.3333     1250.9  1462.50000

In [98]:
from matplotlib import pyplot as plt
%matplotlib inline
def plot_regressions(X_test, y_test, clf):
    print  X_test.shape
    print y_test.shape
    plt.scatter(X_test, y_test,  color='black')
    plt.plot(X_test, clf.predict(X_test), color='blue',linewidth=3)

In [102]:
from sklearn.cross_validation import train_test_split

train_X = Train.ix[:,:4].values
test_X=Test.ix[:,:4].values
input_names=list(data.columns[:4])

for target in ["TAVG","TRANGE","SNWD"]:
    for j in range(1,4):
        y_label = target+"_coeff"+str(j)
        train_y = Train[y_label]
        test_y = Test[y_label]
        lg = LinearRegression()
        lg.fit(train_X,train_y)

        print "\nTarget variable: ", y_label, '#'*40
        print "Coeffs: ",\
            ' '.join(['%s:%5.2f ' % (input_names[i],lg.coef_[i]) for i in range(len(lg.coef_))])
        
        compute_scores(y_label, train_X, train_y, test_X, test_y)


Target variable:  TAVG_coeff1 ########################################
Coeffs:  latitude:-153.41  longitude:-19.60  elevation:-0.69  dist_coast:-0.13 
R-squared(Coeff. of determination): Train:0.932, Test:0.930, Ratio:0.998

removed latitude Score decrease: 	Train:0.608 	Test: 0.618 
removed longitude Score decrease: 	Train:0.071 	Test: 0.062 
removed elevation Score decrease: 	Train:0.132 	Test: 0.116 
removed dist_coast Score decrease: 	Train:0.003 	Test: 0.003 

Target variable:  TAVG_coeff2 ########################################
Coeffs:  latitude:-5.46  longitude: 7.29  elevation:-0.15  dist_coast: 0.48 
R-squared(Coeff. of determination): Train:0.603, Test:0.584, Ratio:0.969

removed latitude Score decrease: 	Train:0.009 	Test: 0.004 
removed longitude Score decrease: 	Train:0.114 	Test: 0.117 
removed elevation Score decrease: 	Train:0.077 	Test: 0.057 
removed dist_coast Score decrease: 	Train:0.391 	Test: 0.380 

Target variable:  TAVG_coeff3 ########################################
Coeffs:  latitude:-4.02  longitude:-2.85  elevation: 0.01  dist_coast: 0.07 
R-squared(Coeff. of determination): Train:0.424, Test:0.392, Ratio:0.925

removed latitude Score decrease: 	Train:0.047 	Test: 0.052 
removed longitude Score decrease: 	Train:0.170 	Test: 0.141 
removed elevation Score decrease: 	Train:0.001 	Test: 0.002 
removed dist_coast Score decrease: 	Train:0.088 	Test: 0.089 

Target variable:  TRANGE_coeff1 ########################################
Coeffs:  latitude:23.62  longitude: 9.07  elevation:-0.34  dist_coast:-0.15 
R-squared(Coeff. of determination): Train:0.474, Test:0.440, Ratio:0.928

removed latitude Score decrease: 	Train:0.052 	Test: 0.055 
removed longitude Score decrease: 	Train:0.055 	Test: 0.044 
removed elevation Score decrease: 	Train:0.120 	Test: 0.120 
removed dist_coast Score decrease: 	Train:0.013 	Test: 0.014 

Target variable:  TRANGE_coeff2 ########################################
Coeffs:  latitude:-32.56  longitude: 6.14  elevation:-0.01  dist_coast: 0.14 
R-squared(Coeff. of determination): Train:0.662, Test:0.629, Ratio:0.950

removed latitude Score decrease: 	Train:0.467 	Test: 0.449 
removed longitude Score decrease: 	Train:0.119 	Test: 0.097 
removed elevation Score decrease: 	Train:0.001 	Test: 0.001 
removed dist_coast Score decrease: 	Train:0.049 	Test: 0.041 

Target variable:  TRANGE_coeff3 ########################################
Coeffs:  latitude: 3.92  longitude: 1.44  elevation: 0.04  dist_coast:-0.04 
R-squared(Coeff. of determination): Train:0.121, Test:0.072, Ratio:0.590

removed latitude Score decrease: 	Train:0.055 	Test: 0.027 
removed longitude Score decrease: 	Train:0.053 	Test: 0.038 
removed elevation Score decrease: 	Train:0.055 	Test: 0.036 
removed dist_coast Score decrease: 	Train:0.029 	Test: 0.016 

Target variable:  SNWD_coeff1 ########################################
Coeffs:  latitude:150.51  longitude:22.40  elevation: 1.15  dist_coast:-0.90 
R-squared(Coeff. of determination): Train:0.242, Test:0.229, Ratio:0.947

removed latitude Score decrease: 	Train:0.155 	Test: 0.154 
removed longitude Score decrease: 	Train:0.025 	Test: 0.024 
removed elevation Score decrease: 	Train:0.098 	Test: 0.090 
removed dist_coast Score decrease: 	Train:0.032 	Test: 0.032 

Target variable:  SNWD_coeff2 ########################################
Coeffs:  latitude: 1.51  longitude:-1.09  elevation:-0.22  dist_coast: 0.24 
R-squared(Coeff. of determination): Train:0.068, Test:0.061, Ratio:0.899

removed latitude Score decrease: 	Train:0.000 	Test: -0.000 
removed longitude Score decrease: 	Train:0.001 	Test: 0.001 
removed elevation Score decrease: 	Train:0.048 	Test: 0.045 
removed dist_coast Score decrease: 	Train:0.032 	Test: 0.027 

Target variable:  SNWD_coeff3 ########################################
Coeffs:  latitude: 8.29  longitude: 0.27  elevation: 0.09  dist_coast: 0.01 
R-squared(Coeff. of determination): Train:0.159, Test:0.113, Ratio:0.713

removed latitude Score decrease: 	Train:0.047 	Test: 0.034 
removed longitude Score decrease: 	Train:0.000 	Test: 0.001 
removed elevation Score decrease: 	Train:0.055 	Test: 0.044 
removed dist_coast Score decrease: 	Train:0.001 	Test: 0.000 

Interpretation

When we find a statistically significant coefficient, we want to find a rational explanation for the significance and for the sign of the corresponding coefficient. Please write a one line explanation for each of the following nine input/output pairs (the ones that are numbered).

Target variable:  TAVG_coeff1 ########################################
Coeffs:  latitude:-153.98  longitude:-19.21  elevation:-0.68  dist_coast:-0.13 
R-squared(Coeff. of determination): Train:0.931, Test:0.931

1. removed latitude Score decrease:     Train:0.613     Test: 0.612 
* Removing the latitute had the largest effect on the accuracy of the prediction of TAVG_coeff1. That is because it is a very strongly negative weight relative to the other coefficeints, therefore it is an important feature.

2. removed elevation Score decrease:    Train:0.128     Test: 0.121 
* This feature of TAVG is highely dependent on elevation. 

Target variable:  TAVG_coeff2 ########################################
Coeffs:  latitude:-5.33  longitude: 7.46  elevation:-0.14  dist_coast: 0.48 
R-squared(Coeff. of determination): Train:0.603, Test:0.585

3. removed longitude Score decrease:    Train:0.115     Test: 0.116 
4. removed dist_coast Score decrease:   Train:0.393     Test: 0.378 

Target variable:  TAVG_coeff3 ########################################
Coeffs:  latitude:-4.19  longitude:-2.64  elevation: 0.01  dist_coast: 0.07 
R-squared(Coeff. of determination): Train:0.420, Test:0.398

5. removed longitude Score decrease:    Train:0.148     Test: 0.164 
6. removed dist_coast Score decrease:   Train:0.095     Test: 0.082 

Target variable:  TRANGE_coeff1 ########################################
Coeffs:  latitude:25.00  longitude: 8.63  elevation:-0.36  dist_coast:-0.15 
R-squared(Coeff. of determination): Train:0.478, Test:0.435

7. removed elevation Score decrease:    Train:0.127     Test: 0.113 

Target variable:  TRANGE_coeff2 ########################################
Coeffs:  latitude:-32.63  longitude: 6.04  elevation:-0.02  dist_coast: 0.14 
R-squared(Coeff. of determination): Train:0.649, Test:0.642

8. removed latitude Score decrease:     Train:0.461     Test: 0.454 

Target variable:  SNWD_coeff1 ########################################
Coeffs:  latitude:147.72  longitude:21.54  elevation: 1.09  dist_coast:-0.88 
R-squared(Coeff. of determination): Train:0.232, Test:0.238

9. removed latitude Score decrease:     Train:0.153     Test: 0.155

Write your answers here

Consult the plots of the eigen-vectors. SNWD is available in an earlier notebook. The statistics for TRANGE and TAVG is in the file http://mas-dse-open.s3.amazonaws.com/Weather/STAT_TAVG_RANGE.pickle

For each of the following eigen-vectors, give a short verbal description

  1. TAVG_coeff1: Avg. Temp. across the year
  2. TAVG_coeff2: Summer & winter temperature diff.
  3. TAVG_coeff3: Fall & winter temp. diff.
  4. TRANGE_coeff1: Summmer & Winter avg daily temp range diff
  5. TRANGE_coeff2: Summer & winter temp change diff
  6. SNWD_coeff1: Averge snow depth (winter)

Once you have given a meaning to each of these eigen-vectors, explain the relation to the input variable. Short explanations are better than long ones.

  1. Increase in avg temp as you go south
  2. Increase in avg temp as elevation decreses
  3. Diff in summer and winter temperature as you go east
  4. Summer & Winter temp. diff as function of distance from coast
  5. Far east and west sides have higher difference in temperature between fall and winter compared to the central parts.
  6. The temperature variance per day incrases as a function of distance to the coast
  7. Average daily temperature range between summer and winter increase as we move lower in elevation.
  8. Difference in termperature range between summer and winter increase as we move further south.
  9. Locations wither high latitude (northern and central parts) get more snow in the winter time compared to the western parts.